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Abstract 

Diffusion and propagation of information, in- 
fluence and diseases take place over increas- 
ingly larger networks. We observe when a node 
copies information, makes a decision or be- 
comes infected but networks are often hidden 
or unobserved. Since networks are highly dy- 
namic, changing and growing rapidly, we only 
observe a relatively small set of cascades be- 
fore a network changes significantly. Scalable 
network inference based on a small cascade set 
is then necessary for understanding the rapidly 
evolving dynamics that govern diffusion. In 
this article, we develop a scalable approxima- 
tion algorithm with provable near-optimal per- 
formance based on submodular maximization 
which achieves a high accuracy in such sce- 
na rio, solving an open problem f irst introduced 
bv lGomez-Rodriguez et al. (2010). Experiments 
on synthetic and real diffusion data show that our 
algorithm in practice achieves an optimal trade- 
off between accuracy and running time. 



1. Introduction 

Over the last years, there has been an increas 
ing interest in understanding diffusion and prop 
agation processes in a broa d range of domains 
information propagation ( Gom ez-Rodri guez et al. 



networks (IKe mpe et al 



2010b . social 
ral marketing dWatts & Do dds 
ogy dWallinga & Teunisl |2004 



els dBrockmann et al.L 120061) 



2007) 



2003) 



and 



epidemiol- 
human trav- 



In the context of diffusion networks, one of the 
fundamental research problems is how to infer 
the connectivi ty of a network b as ed on diffu- 
sion traces ( Gomez-Rodriguez et al. , 2010t 201 1 
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Myers & Leskoved l2010t ISnowsill et ail l201lb . In 
information propagation, we note when a blog or news 
site writes about a piece of information. However, in 
many cases, the blogger or journalist does not link to her 
source and therefore we do not know where she gathered 
the information from. In viral marketing, we get to know 
when customers buy products or subscribe to services, but 
typically cannot observe the trendsetters who influenced 
customers' decisions. Finally, in epidemiology, we can 
observe when a person gets ill but cannot tell who infected 
her. In all these scenarios, we observe spatiotemporal 
traces of information spread (be it in the form of a meme, 
a decision, or a virus) but we do not know the paths over 
which information propagates. We note where and when 
information emerges but not how or why it does. In this 
context, inferring the connectivity of diffusion networks is 
essential to reconstruct and predict the paths over which 
information spreads, maximize sales of a product or stop 
infections. 

Our approach to network inference. We consider that 
on a fixed hypothetical network, diffusion processes prop- 
agate as directed trees through the network. Since we only 
observe the times when nodes are reached by a diffusion 
process, there are many possible propagation trees that ex- 
plain a set of cascades. Naive computation of the model 
takes exponential time since there is a combinatorially large 
number of propagation trees. It has been shown that com- 
putations over this super- exponential set of trees can b e 



performed in cubic time (IGomez-Rodriguez et al.l |2010) 



However, to the best of our knowledge, efficient optimiza- 
tion of the model has been an open question to date. Here, 
we show that computation over the super-exponential set 
of trees can indeed be performed in quadratic time and 
surprisingly, we show that the resulting objective func- 
tion is submodular. Exploiting this natural diminishing 
property, we can efficiently optimize the objective func- 
tion to find a near-optimal network with provable guaran- 
tees that best explain the observed cascades. Lazy eval- 
uation and the local structure of the problem can be used 
to speed-up our method. Considering all possible propa- 
gation trees enables us to learn a network from fewer ob- 
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served cascades. T his is important since social n etworks 
are highly dynamic (IBackstrom & Leskovec[|201 ll) . chang- 
ing and growing rapidly, and we can only expect to record 
a small number of cascades over a fixed network. 
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Related work 

to ours 



The work most closely related 



Myers & Leskoved 
probabilis tic model 



20 10h 



dGomez-Rodriguez et all 120 1 Oi: 12011 



also uses 



NetInf dGomez-Rodriguez et al 
work connectivity using submodular 



a generative 
for inferrin g diff usion networks. 

bOlOh infers the net- 
optimization by 



considering only the most probab le directed tree supported 
by ea ch cascade. NetRate dGomez-Rodriguez et al ' 



2011) and ConNIe (Mvers & Leskovec 



ing 

10) 



2010) infer not 



only the network connectivity but either prior probabilities 
of infection or transmission rates of infection using convex 
optimization by considering all possible directed trees 
supported by each cascade. 

The main innovation of this paper is to tackle the network 
inference problem as a submodular maximization problem 
in which we do not consider only the most probable di- 
rected tree as in NetInf but all directed trees supported 
by each cascade as in ConNIe and NetRate. By con- 
sidering all trees, we are able to infer a network more ac- 
curately than NetInf when the number of observed cas- 
cades is small compared to the network size and by using 
the greedy algorithm for submodular maximization in con- 
trast to convex optimization, we are several order of mag- 
nitude faster than ConNIe and NetRate. Therefore, we 
present a network inference algorithm that may be capable 
of inferring real networks in the order of hundred of thou- 
sands of nodes with a small number of observed cascades. 
This comes with a drawback, our algorithm does not infer 
prior probabilities of infection nor transmission rates but 
only the network connectivity. However, in practice, our 
algorithm provides a measure of importance for each edge 
of the network through the marginal gain that each edge 
provides. 

Inferring how diffusion propagates over rapidly changing 
networks is crucial for a better understanding of the dynam- 
ics that govern processes taking place over information and 
social networks. In this context, scalability is a key point 
given the increasingly larger size of such networks and cas- 
cade data. 

The remainder of the paper is organized as follows: in Sec- 
tion [2 we describe the model of diffusion and state the 
network inference problem. Section [3] shows an efficient 
approximation algorithm with provable near-optimal per- 
formance. Section |4] evaluates our method using synthetic 
and real data and we conclude with a discussion of our re- 
sults in Section|5] 
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(a) Cascade c on G (b) Spanning trees induced by 
cascade c on G 

Figure 1. Panel (a) shows a cascade t = {ti, . . . ,t 5 } on network 
G, where ti-i < ti. Panel (b) shows all connected spanning 
trees induced by cascade t on G, i.e., all possible ways in which 
a diffusion process spreading over G can create the cascade. 



2. Problem formulation 

In this section, we first describe the diffusion data 
our algorithm is designed for and continue revisiting 
the generative model of diffusio n introduced recently 



by iGomez-Rodriguez et al. I l l2010ll . We conclude with a 



statement of the network inference problem. 

Data. We observe a set C of cascades {t 1 , . . . , t' c l } on a 
fixed population of N nodes. A cascade t c := (t£ , . . . , tjy) 
is simply a N-dimensional vector recording when nodes in 
the population get infected. We only observe the time t\ 
when a node % got infected but not who infected the node 
neither why it got infected. In each cascade, there are typi- 
cally nodes that are never infected, with infection times that 
are arbitrarily long. We assume there is an underlying un- 
observed network G that nodes in the population belong to, 
and cascades propagate over it. Our aim is to discover this 
unknown network over which cascades originally propa- 
gated by using only the recorded infection times. 

Pairwise transmission likelihood. We assume node j 
can infect node i with prior probability of transmission 
j3. Now, consider that node j gets infected at time tj and 
succeeds at infects node i at time U. We then assume 
that the infection time ti depends on tj through a pairwise 
transmission likelihood f(ti\tj; a ,,,). As in previous stud 



ies of infor mation propagatio n dGomez-Rodriguez et al 



2010fcl201 lh and epidemiology dWallinga & Teunisll2004 . 
we consider two well-known monotonic parametric mod- 
els: exponential, f(ti\tj] ttji) oc e - "^'^ - '^, and power- 
law, f(ti\tj\a.j t i) oc (ti — i^) -1- "^*, and one non- 
monotonic parametric model: Rayleigh, a^i) oc 
(ti — tj)e~ aj ' i ' < - ti ~ t: >^ . Although we perform experiments 
in networks in which the transmission rate ctjj of each edge 
can be different, in the remainder of the paper, for simplic- 
ity, we assume all transmission rates to be equal, ctjj = a. 
Importantly, our algorithm does not depend on the partic- 
ular choice of pairwise transmission likelihood and choos- 
ing more complicated parametric or non-parametric likeli- 
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hoods does not increase its computational complexity. 

Likelihood of a cascade for a given tree. We assume that 
diffusion processes propagate as directed trees, i.e., a node 
gets infected by action of a single node or parent. Then, for 
a given tree T and cascade t c , we can compute the likeli- 
hood of the cascade given the tree as follows: 



f(t c \T)= [J f(t v \t u ;a), 

(u,v)£E T 



(1) 



where Et is the edge set of tree T. Considering a spe- 
cific tree T for a cascade t c means to set which edges have 
spread successfully the information. Therefore, given the 
tree T, we can compute the likelihood of the infection times 
of the nodes in the cascade t c by using simply the pairwise 
transmission likelihood of each edge of the tree. 

Probability of a tree in a given network. In order to com- 
pute the likelihood of a cascade t c for a given tree T, we 
have considered the tree T to be given. We now compute 
the probability of a tree T in a network G as follows: 



P{T\G) = p 



where Vt is the vertex set of tree T, Et is the edge set of 
tree T, E is the edge set of the network G and q = | Et | = 
| Vr | — lis the number of edges in T and counts the edges 
over which the diffusion process successfully propagated. 
For a particular cascade t c and tree T, Vt is the set of nodes 
that belong to t c , i.e., nodes where the infection time t$ < 

00. The first product accounts for the active edges in G, 

1. e., edges that define the tree T, and the second product 
accounts for the inactive edges in G, i.e., edges where the 
diffusion process did not spread. For simplicity, we assume 
the same prior probability of transmission /3 for every edge 
of the network G. 

Likelihood of a cascade in a given network. Now, for 
a cascade t c , we consider all possible propagation trees T 
that are supported by the network G, i.e., all possible ways 
in which a diffusion process spreading over G can create 
cascade t c : 

/(t c |G) = f(t c \T)P(T\G), 

TeTc(G) 

where t c is a cascade and T C (G) is the set of all the directed 
connected spanning trees on the subnetwork of G induced 
by the nodes that got infected in cascade t c , i.e., tj £ t c : 
ti < oo. Figure [T]illustrates the notion of a cascade and all 
the connected spanning trees T induced by its nodes. 

All trees T 6 T C (G) employ the same vertex set Vt and 
P(T\G) depends only the size of the vertex set Vr- There- 
fore, assuming the same prior probability of transmission 



j3 for every edge of the network, P(T\G) is equal for all 
trees T on the subnetwork of G induced by the nodes that 
got infected in cascade t c and we simplify Eq. (0: 

/(t c |G)cx Y, IT fibula). (3) 

TGTo(G) (u,v)EE T 

Now, assuming conditional independence between cas- 
cades given the network G, we compute the joint likeli- 
hood of a set C of cascades occurring in the network G as 
follows: 



/(t\...,tJ c i|G)= n /(tic). 



(4) 



fee 



Network inference problem. Given a set of cascades 
{t 1 , . . . , t }, a prior probability of transmission (3 and 
a pairwise transmission likelihood f(t v \t u ; a), we aim to 
find the network G such that 



G = argmax/(t 1 ,...,t A, |G), 

\G\<k 



(5) 



where the maximization is over all directed networks G of 
at most k edges. 

3. Proposed algorithm 

To the best of our knowledge, the optimization prob- 
lem defined by Eq. (O has been considered intractable 
in th e past and proposed as an in teresting open prob- 



lem ( Gomez-Rodriguez et al. , 2010h . We now show how 



to efficiently find a solution with provable sub-optimality 
guarantees by exploiting a natural diminishing returns 
property of the network inference problem: submodularity. 

To evaluate Eq. |@), we need to compute Eq. (O for each 
cascade t c , i.e., compute a sum of likelihoods over all pos- 
sible connected spanning trees T induced by the nodes in- 
fected in each cascade. Although the number of trees can 
be super-exponential in the number of nodes in the cascade 
t°, this super-exponential sum can be performed in time 
polynomial in the number n of nodes in t c , by applying 
Kirchhoff's matrix tree theorem: 

(2) Theorem 1 (Tutte (1948)). Given a directed graph W with 
non negative edge weights Wij, construct a matrix A such 

that di,j = J2k w k,j ifi = 3 an d a i.j = — w i.j if * 3 
and denote the matrix created by removing any row x and 
column y from A as A XtV . Then, 



(-l)*+«'det(A, lV ) 



e n 

T£T(W) (iJ)£T 



(6) 



where T is each directed spanning tree in W that starts in 
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Algorithm 1 Our network inference algorithm 
Require: G, k 

G^K; 

while \G\ < k do 

for all (j, i)£G; 3t c £ C with tj < U do 
6j,i = 0, Mj,j 4- 0; 
for all t c : tj < U do 

w c (m, n) weight of (m, n) in G U {(j, i)}; 
for all i m : t m < ti, m ^ j do 

^c,j,t = <^c,i,t + w c (m,i); 
end for 

= log(<5 CjJ)i + w c (j,i)) - log{6 c j,i + 1) 
end for 
end for 

<- argmax^^o^i; 
G^GU{(j*,i*)}; 
end while 
return G; 



to 



In our case, we compute Eq. ® by setting 
f(tj\U; a) and computing the determinant in Eq. ©. We 
then compute Eq. (Hji by multiplying the determinants of 
|G| matrices, one for each cascade. For a fixed cascade t c , 
edges with positive weights form a directed acyclic graph 
(DAG) (only edges such that ti < tj have positive 
weights) and a DAG with a time-ordered labeling of its 
nodes has an upper triangular connectivity matrix. Thus, 
the matrix A x<y of TheoremQ] by construction, is also up- 
per triangular. Fortunately, the determinant of an upper tri- 
angular matrix is simply the product of the elements of its 
diagonal and then, 

/(t c |G)cx [] ]T f{tj\U;a). 

t,£t<= t i Gt c :t i <t I - 



This means that instead of using super-exponential time, 
we are now able to evaluate Eq. |4] in time 0(|C| • -/V 2 ), 
where N is the size of the largest cascade, i.e., the time re- 
quired to build A x , y and compute the determinant for each 
of the G cascades. 

Until now, we have igno red the role of missed infec- 
tions ( Sadikov et al. , 201 lb or e xternal sources as mass me- 
dia dKatz & Lazarsfeldl 1 19551: IWatts & Doddi. 120071) that 
can produce disconnected cascades. To overcome this 
point, we consider an additional node m that represents an 
external source that can infect any node u in a cascade. 
Therefore, we connect the external influence source m to 
every other node u with an e-edge. Every node u can get 
infected by the external source m with an arbitrarily small 
probability e. It is important to remark that adding the ex- 
ternal source results in a tradeoff between false positives 
and false negatives when detecting cascades. The higher 
the value of e, the larger the number of nodes that are as- 



sumed to be infected by an external source. 

Putting it all together, we include the additional node m in 
every cascade t c and we set the likelihood of a diffusion 
process to spread from m to any node j in the cascade t c 
to e. We assume that e < f(tj\ti\a) for any We 
then define the improvement of log-likelihood for cascade 
t c under graph G over an empty graph K: 



F(t c |G)=^log Yl 

tjEt c V ti& c :ti<tj 

where w c (i,j) = e^ 1 f{tj\ti]a) > for all natural like- 
lihoods, J2ieG t :■>*■ w c(hj) ^ 1 ar, d we assume that the 
e-edges between m and all nodes in the cascade t c exist 
also for the empty graph K. 

Finally, maximizing Eq. ^ is equivalent to maximizing the 
following objective function: 



F c (t\...,tl c l|G)= J2 F ( tC \G), (8) 
where G is the variable. 



fee 



Efficient optimization. By construction, the empty 
graph K has score 0, F c (t\ . . . ,tl c l|A') = 0, and 
the objective function Fq is non-negative monotonic, 
F c (t\...,tl c l|G) < F c (t\...,tl c l|G'), for any G C 
G'. Therefore, adding more edges to G never decreases the 
solution quality, and thus the complete graph maximizes 
Fc- However, in real-world scenarios, we are interested 
in inferring sparse graphs with a small number of edges. 
Thus, we would like to solve: 

G* = argmaxi i c , (t 1 , . . . , t' c ' |G), (9) 

\G\<k 

where the maximization is over all directed networks G of 
at most k edges. Naively searching over all k edge graphs 
would take time exponential in k, which is intractable. 
Moreover, finding the optimal solution to Eq.|9]is NP-hard: 

Theorem 2. The diffusion network inference problem de- 
fined by Eq. \9\is NP-hard. 

Proof. By reduction fro m the MAX-fc-COVER pro- 
blem iKhunireLailliiil). □ 



While finding the optimal solution is hard, we will now 
show that Fc satisfies submodularity on the set of directed 
edges in G, a natural diminishing returns property, which 
will allow us to efficiently find a provable near-optimal so- 
lution to the optimization problem. 

A set function F : 2 W —> R mapping subsets of a finite 
set W to the real numbers is submodular if whenever A C 
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(b) PR (Hierarchical, Pow) 
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Figure 2. Panels (a-c) plot precision against recall (PR); panels (d-f) plot accuracy. To control the solution sparsity or precision-recall 
tradeoff, we sweep over k (number of edges) in our method and NetInf and over p (penalty factor) in ConNIe. NetRate has no 
tunable parameters and therefore outputs a unique solution. (a,d): 1,024 node random Kronecker network with Rayleigh (Ray) model. 
(b,e): 1,024 node hierarchical Kronecker network with power-law (Pow) model. (c,f): 1,024 node core-periphery Kronecker network 
with exponential (EXP) model. In all three networks, we recorded 200 cascades. 



B C Wands £ W\B, it holds that F(Au{s})- F(A) > 
F{B U {s}) — F(B), i.e., adding s to the set A increases 
the score more than adding s to the set B. We have the 
following result: 

Theorem 3. Let V be a set of nodes, and C be a collection 
of cascades hitting the nodes V. Then Fc{t x , . . . , t' c '\Gf) 
is a submodular function Fc '■ 2 W — > M. defined over sub- 
sets W C V x V of directed edges. 

Proof. Fix a cascade t c , graphs G C G' and an edge 
e = (r, s) not contained in G'. We will show that 
F(t c |G U {e}) - F(t c |G) > F(t c |G' U {e}) - F(t c |G'). 
Let Wij be the weight of edge in G, and w[ ^ in 

G'. Since G C G', it holds that %d , > w„ > 0. If 

Let 



w 



is contained in G and G', then w^j 

T A,e = Y,izA\{r}:ti>ti w c(h s). It holds that T G / :£ > 
Tq e . Hence, 



F(t c |GU{e})-F(t c |G)=log 

> log 



T G . e + w c (r, s] 

T G ,e 
TG',e+W c (r, s) 



= J F(t c |G'U{e})-F(t c |G'), 

proving submodularity of F(t c \G). Now, since nonnega- 
tive linear combinations of submodular functions are sub- 



modular, the function 

F c (t\...,tl c l|G) = £F(t c |G) 



is submodular as well. 



□ 



We now optimize Fc(G) by using the greedy algorithm, a 
well-known efficient heuristic with provable performance 
guarantees. The algorithm starts with an empty graph K 
and it adds edges that maximize the marginal gain se- 
quentially. That means, at iteration i we choose the edge 
e; = argmaXggG^. j F c {G^i U {e}) - F c {Gi-\). 

The algorithm stops once it has selected k edges, and re- 
turns the solution G = {ei, . . . , e^}. The greedy algo- 
rithm is guaranteed to find a set G which achieves at least 
a constant fraction (1 — 1/e) (of the optima l value achiev- 



able using k edges dNemhauser et al.Lll9 78f). Starting from 



the near-optimal solution given by the greedy algorithm, 
we could possibly improve the solution by applying a local 
search procedure. 

As in the original NetInf formulation, our algorithm also 
allows for two speeds-up: localized updates and lazy eval- 
uation (Algorithm[T]i. We can also obtain an on-line bound 
base d simply on the subm odularity of the objective func- 
tion (Le skovec et al 1 120071) . 
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(b) Hierarchical 
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Figure 3. Gain in Area Under the ROC curve (AUC) of our method compared to NetInf vs number of cascades for (a) a random 
Kronecker network, (b) a hierarchical Kronecker network and (c) a core-periphery Kronecker network with 1,024 nodes and 1,024 edges 
for all three transmission models. Our method is able to more accurately infer a network for small number of cascades and it exhibits 
similar performance to NetInf for larger number of cascades. 



4. Experimental evaluation 

We evaluate our network inference algorithm in both syn- 
thetic and real networks. We use synthetic networks that 
aim to mimic the structure of social networks, and real 
information networks that are based on the Meme Tracker 
dataseQ. We compare our method in terms of precision, 
recall, accuracy and scalability with several state-of-the- 
art algorithms: NetInf, ConNIe and NetRate. For the 
comparisons, we use the public domain implementations of 
these algorithms. 

4.1. Experiments on synthetic data 

Experimental setup. We first generate synthetic networks 
using two different well-known models of social networks: 



the Forest Fire (scale free) model (Barabasi & Albert 
19991) and the Kronecker model dLeskovec et al 1 l2010h 



and set the pairwise transmission rates of the edges of the 
networks by drawing samples from a ~ [7(0.5, 1.5). We 
then simulate and record a relatively small set of prop- 
agating cascades over each network using three different 
pairwise transmission likelihoods: exponential, power-law 
and Rayleigh. There are several reasons why we consider 
small set of cascades in comparison to the network size. 
First, all methods (including ours) assume that cascades 
propagate over a fixed network. Since social ne tworks are 
highly dynamic (iBackstrom & Leskoved 1201 II) . changing 
and growing rapidly, we can only expect to record a small 
number of cascades over a fixed network. Second, track- 
ing a nd recording cascades i s a difficult and expensive pro- 
cess dLeskovec et al.l 120091) . Therefore, it is desirable to 
develop network inference methods that work well with a 
small number of observed cascades. 

Accuracy. We compare the inferred and true networks via 
three measures: precision, recall and accuracy. Precision 



is the fraction of edges in the inferred network G present 
in the true network G* . Recall is the fraction of edges 
of the true network G* present in the inferred network G. 

Accuracy is 1 - ^li^+^^&l) ■ where 7 ( a ) = 1 
if a > and 1(a) = otherwise. Inferred networks with 
no edges or only false edges have zero accuracy. 

Figure [2] compares our method the precision, recall and ac- 
curacy of our method with for three diffe rent 1,024 node 
Kronecker networks: a random network dErdos & Renvi 



19601) (paramet er matrix [0 5 0.5; 0.5, 0.5]), a hierar- 
chical network dClauset et all 120081) ([0.9. 0.1; . 1, 0.91 ) 



2008) 



and a core-periphery network ( Leskovec et al.l 
([0.9, 0.5; 0.5, 0.3]), and 200 observed cascades. In terms 
of precision-recall, our method is able to reach higher re- 
call values than NetInf, ConNIe and NetRate, i.e., it 
is able to discover more true edges from a small number 
of cascades than other methods. For recall values that are 
reachable using NetInf, our method and NetInf offer 
a very similar precision value. Our methods allows for 
higher recall in comparison with NetInf because it gets 
exhauster^ later for considering all possible trees per cas- 
cade instead of only the most probable one. In terms of 
accuracy, our method outperforms NetInf for more than 
half of their outputted solutions, and matches the remain- 
ing ones. ConNIe and NetRate' s accuracy is typically 
significantly lower. However, NetRate is able to beat 
all other methods for the hierarchical Kronecker network. 



If we compare with previous studies (IMvers & Leskovec 
2010). the performance of ConNIe seem to degrade the 
most due to the limited availability in cascades and perhaps 
the variable transmi ssion rates across the networks (as re- 
ported previously in lGomez-Rodriguez et al.l (1201 II) ). 



Performance vs. cascade coverage. Intuitively, the more 



Data available at http : / /memetracker . orgl 



2 A greedy method (ours and NetInf) gets exhausted at iter- 
ation k when there are not any more edges with marginal gain 
larger than zero. 
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Figure 4. Average running time per edge added against number of 
cascades. We used a 1,024 node random Kronecker with expo- 
nential transmission model. 



cascades we observe, the more accurately any algorithm 
infers a network. Actually, when the number of cascades 
is large in comparison to the network size, we expect dif- 
ferences in performance among methods become negli- 
gible. Figure [3] plots the gain in Area Under the ROC 
curve (AUC) for our method in comparison with Net- 
Inf, (AUC outmethod - AUC NetInf )/AUC NetInf , against num- 
ber of observed cascades for several Kronecker networks 
and transmission models (f3 = 0.5 and a ~ [7(0.5, 1.5) in 
all models). We observe that the difference in performance 
between our method and NetInf is greater for small num- 
ber of cascades and for a large enough number of cascades, 
both methods perform similarly or NetInf slightly outper- 
forms our method. 

Scalability. Figure |4] plots the average computation time 
per edge added against number of cascades. Since Net- 
Rate is not greedy and instead solve a convex program for 
each node in the network, we divided their total running 
times by the number of edges that our method added until 
getting exhausted (until no edge has marginal gain greater 
than zero). We used the publicly available implementations 
of our algorithm and NetInf, both coded in C++. To carry 
out a fair comparison with NetRate, we have developed 
a projected full gradient descend C++ implementation of 
NetRate, which is considerably faster than the publicly 
available Matlab implementation (that uses the CVX con- 
vex solver), and we run 10 and 20 iterations of full gra- 
dient descend (remarkably, even running one single itera- 
tion was slower than NetInf and our method). We do not 
report running times for ConNIe since the publicly avail- 
able code is a Matlab implementation (that uses the SNOPT 
solver) and probably slower than a C++ implementation. 
Our method and NetInf are approximately one order of 
magnitude faster than NetRate. Finally, note that the run- 
ning time of our algorithm does not depend on the network 
size but the number of cascades and cascade size. As an 
experimental validation, we run our algorithm in two net- 
works with 100, 000 and 200, 000 nodes and an average of 
two edges per node using 10,000 cascades and our algo- 
rithm took only 10.12 ms and 12.14 ms per edge added. 



Figure 5. Real data. Panel (a) plots precision-recall and panel (b) 
accuracy on a 1,000 node hyperlink network with 10,000 edges 
using 1,000 cascades and a power-law model. To control the so- 
lution sparsity or precision-recall tradeoff, we sweep over k (num- 
ber of edges) in our method and NetInf and over p (penalty fac- 
tor) in ConNIe. Our method beats others for the majority of their 
outputted solutions. 



4.2. Experiments on real data 

Experimental setup. We use the publicly available 
MemeTracker dataset, which contains more than 172 mil- 
lion new s articles and blog po sts from 1 million online 



sources (ILeskovec et all 12009b . Sites publish pieces of 



information and use hyperlinks to refer to their sources, 
which are other sites that published the same or closely 
related pieces of information. Therefore, we use hyper- 
links to trace information propagation over blogs and media 
sites. A hyperlink cascade is simply a collection of time- 
stamped hyperlinks between sites (in blog or news media 
posts) that refer to the same or closely related pieces of 
information. We record one hyperlink cascade per piece 
or closely related pieces of information. We extract the top 
1 ,000 media sites and blogs with the largest number of doc- 
uments, 10,000 hyperlinks and 500 longest hyperlink cas- 
cades. We create a ground truth network G which contains 
an edge between a site u and a site v if there is at least a site 
post in the site u that links to a post on the site v. We then 
infer a network G from the hyperlink cascades and evaluate 
precision, recall and accuracy with respect to G. We con- 
sider a power law pairwise transmission likelihood. Note 
that we trace the flow of information and create a ground 
truth network using hyperlinks because we are interested in 
a quantitative evaluation of our method in comparison with 
the state of the art. For richer qualitative insights, cascades 
based on short textual phrases should be considered, but 
that goes beyond the scope of this paper. 

Accuracy. Figure [5] shows precision, recall and accuracy 
of our method in comparison with NetInf, ConNIe and 
NetRate. Our method reaches higher recall values than 
any other methods. In terms of accuracy, it beats others for 
the majority of their outputted solutions. As in the synthetic 
experiments, the shortage of recorded cascades degrades 
ConNIe' s performance dramatically. 
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5. Conclusions 

We have developed an efficient approximation algorithm 
with provable near-optimal performance that solves an 
open problem on network inferen ce from diffusion traces 



(or ca scades) first introduced by iGomez-Rodriguez et al 



(1201(1 ] n our work, for each observed cascade we consider 
all possible ways in which a diffusion process spreading 
over the network can create the cascade, in contrast with 
NetInf, that considers only the most probable way (tree). 

Perhaps surprisingly, despite considering all trees, we show 
experimentally that the running time of our method and 
NetInf are similar, and they are several orders of mag- 
nitude faster than alternative network inference methods 
based on convex programming as NetRate and ConNIe. 
Moreover, our algorithm typically outperforms NetInf, 
NetRate and ConNIe in terms of precision, recall and 
accuracy in highly dynamic networks in which we only ob- 
serve a relatively small set of cascades before they change 
significantly. 
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